(Explain why you chose this topic, and the questions you are interested in studying. List team members and a description of how each contributed to the project.)
NYC is a hub of diversity, yet there is one similarity that 8.5 million people shares – everyone eats. Over 26,000 eating establishments in the city (according to the NYC Department of Health and Mental Hygiene or DOHMH) thrives here because New Yorkers dine out over 58% of their lunches or dinners and spend $46.14 on average for each meal per person (Zagat Dining Trends Survey 2018). For these customers, ourselves included, the food that we purchase so frequently and at such high costs is a major factor influencing our health. To help people stay informed about food safety and the potential health risks that eating out poses, we conducted our research about restaurants in NYC based on DOHMH’s source data hosted on Open Data.
Cindy- data description, data quality, time series visualization Selina- restaurant case Lisa- violation distribution and analysis Julie-
Our dataset is collected from NYC Open Data along with a data description provided by the website
375,989 rows of data entries of NYC restaurant inspection information between June 2012 - April 2018
Each row consists of 18 variables including (and not limited to) restaurant information (name, address, zipcode, borough, cuisine type), inspection results (inspection type, inspection date), violation details (violation code with corresponding violation description), and scorings (score, letter grade). For the sole purpose of efficiently visualizing inspection results, we only selected the most relevant and most representative 10 variables in the dataset to conduct further exploration. The ten vairbales this report focuses on are - DBA (restaurant name), borough, zipcode, cuisine description,inspection date,violation code, violation description, score, grade, and inspection type
Since restaurants go in and out of business, the dataset only records restaurants that are still active in March 2018 (last inspection included in the data)
click here to see interactive visna plots made with D3
We first explore the distribution of missing data in the original dataset. As shown in the chart and plot, the most common missing pattern in the original data is Grade on its own with over 50% missing data, followed by only around 5% in scores and minor missing percentages in violation details.
## # A tibble: 10 x 4
## variable type value percent_missing
## <chr> <chr> <dbl> <dbl>
## 1 GRADE character 188585 50.2
## 2 SCORE integer 20466 5.45
## 3 VIOLATION.DESCRIPTION character 7106 1.89
## 4 VIOLATION.CODE character 6010 1.60
## 5 INSPECTION.TYPE character 1108 0.295
## 6 DBA character 419 0.112
## 7 BORO character 0 0
## 8 ZIPCODE character 0 0
## 9 CUISINE.DESCRIPTION character 0 0
## 10 INSPECTION.DATE character 0 0
From the official grading system in data description, we learned that the grades are converted from scores accroding to a specific scheme for every ‘gradable’ inspection, and only inspections of certain types are eligble to receive a grade. Due to the large amount of missing values in Grade, we pay more attention to scores and violation details since they are good and direct reflections of inspection results and contain much fewer missing data.
In addition to grades, the data description also states that some scores are missing because there are new restarants that have yet to be inspected, and they are marked by “inspection.date = 01/01/1900”. Therefore, we removed rows with new restaurants’ and looked at missing values again.
The second chart shows the missing value percentages for each variable after removing new restaurants. We see a decrease in missing scores and for our exploratory analysis, we removed all missing values in scores.
## [1] "Number of new restaurants: 1108"
## # A tibble: 10 x 4
## variable type value percent_missing
## <chr> <chr> <dbl> <dbl>
## 1 GRADE character 187478 50.1
## 2 SCORE integer 19358 5.17
## 3 VIOLATION.DESCRIPTION character 5998 1.60
## 4 VIOLATION.CODE character 4902 1.31
## 5 DBA character 0 0
## 6 BORO character 0 0
## 7 ZIPCODE character 0 0
## 8 CUISINE.DESCRIPTION character 0 0
## 9 INSPECTION.DATE character 0 0
## 10 INSPECTION.TYPE character 0 0
The third chart summarizes the final data we used for the report. Although there are still many missing values in grades, we do not consider it bad data quality since they are intentionally left blank.The overall quality of the dataset seems quite promising at this point.
## # A tibble: 10 x 4
## variable type value percent_missing
## <chr> <chr> <dbl> <dbl>
## 1 GRADE character 168130 47.4
## 2 VIOLATION.DESCRIPTION character 1934 0.545
## 3 VIOLATION.CODE character 1492 0.420
## 4 DBA character 0 0
## 5 BORO character 0 0
## 6 ZIPCODE character 0 0
## 7 CUISINE.DESCRIPTION character 0 0
## 8 INSPECTION.DATE character 0 0
## 9 SCORE integer 0 0
## 10 INSPECTION.TYPE character 0 0
To ensure data quality and validify our further analysis, we inspect the data more closely based on two variables we are particularly interested in- boroughs and inspection types.
# By borough
percent_missing1 <- df_1 %>% group_by(BORO) %>%
summarize(num_Restaurants = n(), num_Missing_Score = sum(is.na(`SCORE`))) %>%
mutate(percent_Missing_Score = round(num_Missing_Score/num_Restaurants, 2)) %>%
arrange(-percent_Missing_Score)
p1<- ggplot(percent_missing1)+
geom_line(aes(x= BORO, y =percent_Missing_Score, group=1))+
geom_point(aes(x= BORO, y =percent_Missing_Score),color = "hotpink")+
ylim(0,0.15)+
ggtitle("Missing data percentage by borough")
p1 <- ggplotly(p1)
p1
# By inspection type
percent_missing <- df_1 %>% group_by(INSPECTION.TYPE) %>%
summarize(num_Restaurants = n(), num_Missing_Score = sum(is.na(`SCORE`))) %>%
mutate(percent_Missing_Score = round(num_Missing_Score/num_Restaurants, 2)) %>%
arrange(-percent_Missing_Score)
p2 <- ggplot(percent_missing)+
geom_line(aes(x= INSPECTION.TYPE, y =percent_Missing_Score, group=1))+
geom_point(aes(x= INSPECTION.TYPE, y =percent_Missing_Score),color = "blue")+
ggtitle("Missing data percentage by Inspection Type")
p2 <- ggplotly(p2)
p2
Each borough has very similar percentage in missing values so it is fair to conclude that the data qulity of our dataset is identical across the city.
(Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc.)
The following time series plot shows the change in quarter average score for the past five year. There is a clear pattern that the third quarter (Jul - Sept) has the highest average score each year. Thinking in terms of season and weather, since July, August and September are typically the hottest months in NYC, we can infer that there are more violations during the summer.
df$INSPECTION.DATE <- as.Date(as.character(df$INSPECTION.DATE),format="%m/%d/%Y")
ts <- xts(df$SCORE,df$INSPECTION.DATE)
quarterly <- apply.quarterly(ts, FUN =mean)
ts_df<- data.frame(date=index(ts),score = ts)
quarterly_df <- data.frame(date = index(quarterly), quarter_avg = quarterly)
p<-plot_ly(quarterly_df[quarterly_df$date> "2013-05-16" & quarterly_df$date < "2018-03-31" ,])%>%
add_lines( x=~date, y=~quarter_avg,
type = 'scatter',
mode = 'lines+markers',
hoverinfo = 'text',
text = ~paste("Quarter Date:",date,"<br> Average Score: ",round(quarter_avg)))%>%
add_trace(x = ~date,y = ~quarter_avg, mode = 'markers',color=I("hotpink"),marker = list(size = 8))%>%
layout(title = "Quarter Average Score",showlegend = FALSE)
p
In the following sections, we will dive deeper into different violations, scores, and some of our most frequented restaurants.
## [1] 354841
convert_to_grade <- function(x){
if (x < 14){
return("A")
}
else if(x > 28){
return("C")
}
else {
return("B")
}
}
grade <- sapply(df$score, convert_to_grade)
df$grade <- grade
Combining the cuisine descriptions
df$cuisine <- factor(df$cuisine)
levels(df$cuisine) <- sub("Pizza/Italian", "Italian", levels(df$cuisine))
levels(df$cuisine) <- sub("Pizza", "Italian", levels(df$cuisine))
levels(df$cuisine) <- sub("Café/Coffee/Tea", "Dessert", levels(df$cuisine))
levels(df$cuisine) <- sub("Fruits/Vegetables", "Salads",levels(df$cuisine))
levels(df$cuisine) <- sub("Hotdogs/Pretzels", "Hotdogs", levels(df$cuisine))
levels(df$cuisine) <- sub("Ice Cream, Gelato, Yogurt, Ices", "Dessert", levels(df$cuisine))
levels(df$cuisine) <- sub("Juice, Smoothies, Fruit Salads", "Salads", levels(df$cuisine))
levels(df$cuisine) <- sub("Latin (Cuban, Dominican, Puerto Rican, South & Central American)",
"Latin", levels(df$cuisine))
levels(df$cuisine) <- sub("Sandwiches/Salads/Mixed Buffet", "Sandwiches", levels(df$cuisine))
levels(df$cuisine) <- sub("Bottled beverages, including water, sodas, juices, etc.", "Dessert", levels(df$cuisine))
levels(df$cuisine) <- sub("Not Listed/Not Applicable", "Other", levels(df$cuisine))
levels(df$cuisine) <- sub("Latin (Cuban, Dominican, Puerto Rican, South & Central American)", "Latin", levels(df$cuisine))
levels(df$cuisine) <- sub("Soups & Sandwiches", "Sandwiches", levels(df$cuisine))
Because some of the restaurant types are too specific, we decide to combine some of the categories to make it more simple and interpretable. As a result, all the restaurants with restaurant types that include word Pizza are categorized as Italian restaurants; all the cafes, stores that sell ice cream, drinks are categorized as Dessert; the restaurants with the restaurant type of not listed/not applicable are categorized as others, etc.
df$score <- as.numeric(df$score)
ggplot(df, aes(score)) + geom_histogram(binwidth = 5, boundary = 0, color = "black", fill = "blue") + labs(title = "Distribution of scores")
ggplot(df, aes(score)) + geom_histogram(binwidth = 5, boundary = 0, color = "black", fill = "blue") + labs(title = "Distribution of scores based on borough") +
facet_grid(boro~., scale = "free")
The plot shown above describes the overall pattern of the scores. It is shown that the distribution of scores is skewed to the right. While it does not look reasonable to have a negative score, some of the restaurants receive the negative score (This is why the histogram doesn’t start from 0). The scores that the restaurants mostly received for any type of violations are between 10 to 20. Also, there exist quite a lot of restaurants that receive a score higher than 25 for any type of violations. Because the score above 25 is associated with grade C, which is bad, we could probably consider such restaurants as not cleaned and not well maintained.
Also, we plot the distribution of scores based on boroughs to explore the relationship between borough and scores. Because we are interested in the distribution for each borough, we set the scale to free to make sure the distributions for some boroughs do not shrink. While there is a slight change in the distribution of scores across boroughs, they are all highly skewed to the right, and the shape of the distributions looks almost the same. This shows that there is no or little relationship between the boroughs and the scores.
# grade distribution for each violation code
violation_df <- df %>% select(violation_code, grade) %>% group_by(violation_code, grade) %>% summarize(count = n())
violation_df <- violation_df %>% filter(!is.na(violation_code))
violation_df = df %>% select(violation_code) %>% group_by(violation_code) %>% summarize(count = n()) %>% arrange(-count)
ggplot(violation_df, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = "Distribution of violation code", x = "violation code")
violation_df_top10 <- violation_df[1:10, ]
violation_des <- violation[violation$VIOLATION.CODE %in% violation_df_top10$violation_code, ]
violation_des$VIOLATION.CODE <- factor(violation_des$VIOLATION.CODE, levels = violation_df_top10$violation_code)
violation_des <- violation_des %>% arrange(VIOLATION.CODE)
print(violation_des)
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 04L
## 4 06C
## 5 06D
## 6 02G
## 7 10B
## 8 02B
## 9 04N
## 10 04H
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 4 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 5 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 6 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 8 Hot food item not held at or above 140Ã\u0082º F.
## 9 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10 Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.
ggplot(violation_df_top10, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = "Distribution of top 10 violation code", x = "violation code")
The most commonly violated violation type is 10F, ‘Non-food contact surface improperly constructed’. While this violation seems not too serious, the 08A and 04L are associated with vermin and mice, which sound more disgusting and serious. Moreover, 06C is the violation code that is ‘Food not protected from potential source of contamination during storage, preparation, transportation, display or service.’ Learning that there are restaurants in New York which have issue with vermin and mice while not properly storing and protecting foods from any potential source of contamination may suggest us to be very careful and picky when deciding which restaurant to go.
#getting the number of restaurants for each restaurant type
distinct_res <- df %>% select(cuisine, name) %>% distinct(cuisine, name) %>% mutate(count = 1)
distinct_res <- distinct_res %>% select(cuisine, count) %>% group_by(cuisine) %>% summarise(total_num = sum(count)) %>% arrange(-total_num)
ggplot(distinct_res, aes(reorder(cuisine, total_num), total_num)) + geom_col() + coord_flip() +
labs(title = "Number of restaurants", x = "Cuisine")
top5res <- distinct_res[1:5, ]
ggplot(top5res, aes(reorder(cuisine, total_num), total_num)) + geom_col() + coord_flip() +
labs(title = "Number of restaurants", x = "Cuisine")
The bar chart shows the number of restaurants for each restaurant type. Because there are more than 50 restaurant types in this data, we decide to focus on 5 most common restaurant types. The top 5 restaurant types are following: American, Italian, Chinese, Dessert, and Latin. As shown in the plot above, the most common restaurant type is American (5310), followed by Italian (2224) and Chinese (2088).
# getting the information associated with 5 most common restaurant types
top5 <- df[df$cuisine %in% top5res$cuisine, ]
top5 <- na.omit(top5)
top5$cuisine <- factor(top5$cuisine)
# group by violation code to see the most commonly violated violation codes
top5_violations = top5 %>% select(violation_code) %>% group_by(violation_code) %>% summarize(count = n()) %>% arrange(-count)
ggplot(top5_violations, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = "Distribution of violation code", x = "violation code")
violations_10mostcommon <- top5_violations[1:10, ]
new_violation <- violation[violation$VIOLATION.CODE %in% violations_10mostcommon$violation_code, ]
new_violation$VIOLATION.CODE <- factor(new_violation$VIOLATION.CODE, levels = violations_10mostcommon$violation_code)
new_violation <- new_violation %>% arrange(VIOLATION.CODE)
print(new_violation)
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 06D
## 4 04L
## 5 06C
## 6 02G
## 7 10B
## 8 02B
## 9 04N
## 10 04H
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 4 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 6 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 8 Hot food item not held at or above 140Ã\u0082º F.
## 9 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10 Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.
ggplot(violations_10mostcommon, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = "Distribution of 10 most common violation code for 5 most common restaurant types", x = "violation code")
We only focus on the restaurants from 5 most common restaurant types to explore what type of violations the restaurants made the most. The above plot shows 10 most common violations made by these restaurants. The violation code that was most frequently appeared is 10F, which is ‘Non-food contact surface improperly constructed. Unacceptable material used’ While this violation does not sound very critical, 04L is associated with mice, which sounds more serious and critical. Out of 93 violations, this one has ranked at top 7 and is violated 4473 times. Because the above plot indicates the overall frequency of violations, we decide to explore the frequency of the violation codes for specific restaurant type. Again, as discussed above, we focus only on the violations made by restaurants from 5 most common restaurant types.
# this funciton takes the string (restaurant type) as an input and returns a bar chart and a dataframe that contains the information about the violation codes and the corresponding violation descriptions
violation_code_with_restype <- function(x){
new_df <- df %>% filter(cuisine == x) %>% select(violation_code) %>% group_by(violation_code) %>%
summarize(count = n()) %>% arrange(-count) %>% top_n(10)
new_violation <- violation[violation$VIOLATION.CODE %in% new_df$violation_code, ]
new_violation$VIOLATION.CODE <- factor(new_violation$VIOLATION.CODE, levels = new_df$violation_code)
new_violation <- new_violation %>% arrange(VIOLATION.CODE)
print(new_violation)
ggplot(new_df, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = paste("Distribution of violation code of", x, "restaurants", sep = " "), x = "violation code")
}
We first focus on the violation types that American restaurants made. In order to produce a bar chart of 10 most common violation made by these restaurants, the steps we take are following: 1. we first filter out the rows with restaurant type = American 2. Then, we group the dataframe by the violation code and then count the number of each violation has made 3. we arrange the dataframe by the number of the occurrence and then extract top 10 violation codes to produce a bar chart The following steps are used to draw a plot for any restaurant type described below.
While the frequency of the violation code changes (because we are only looking at one specific type of the restaurants), the most commonly violated violation types stay almost the same. The 4 most common violation types are same as the plot shown before this one. Though, these restaurants seem to violate 02G, which associates with the storing cold food item above 41 Farenheit, more often than 10B, which associates with not properly installing or maintaining plumbing. Also, they tend to violate the rule associated with storing and maintaining foods as their 10 most violated code is ‘Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan’
violation_code_with_restype("Italian")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 04L
## 4 06D
## 5 02G
## 6 02B
## 7 06C
## 8 10B
## 9 04N
## 10 04A
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 4 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 5 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 6 Hot food item not held at or above 140Ã\u0082º F.
## 7 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 8 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10 Food Protection Certificate not held by supervisor of food operations.
Next, we decide to explore the violation types the Italian restaurants have made. Comparing all three plots, it seems that the restaurants of different types tend to make similar violations. The 10 most common violation types for Italian restaurants are exactly same as the American restaurants except that the orders of violation codes (in terms of frequency) changes. The Italian restaurants in New York area seem to violate the rule associated with installing and maintaining plumbing properly more than the American restaurants in New York. Moreover, compared to the overall frequency of violation codes, these two types of restaurants seem to have more issue with mice.
violation_code_with_restype("Chinese")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 06C
## 4 04L
## 5 02B
## 6 02G
## 7 06D
## 8 10B
## 9 04M
## 10 04N
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 4 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5 Hot food item not held at or above 140Ã\u0082º F.
## 6 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 8 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9 Live roaches present in facility's food and/or non-food areas.
## 10 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
The 10 most common violation types for Chinese restaurants seem to be slightly different from those of the previous two restaurant types. The violation of 04L (Evidence of mice or live mice present in facility’s food and/or non-food areas) is more frequent (in terms of relative frequency). Moreover, the 04M code is violated more than 1000 times and is ranked at top 9. This one is associated with live roaches present in facility’s food and/or non-food areas. So far, our results on the different types of restaurants show us that it would probably be a great idea to look at this report first to avoid the restaurants with roaches and mice.
violation_code_with_restype("Dessert")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 04L
## 4 06D
## 5 06C
## 6 10B
## 7 04A
## 8 06E
## 9 04N
## 10 04H
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 4 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 5 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 6 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 7 Food Protection Certificate not held by supervisor of food operations.
## 8 Sanitized equipment or utensil, including in-use food dispensing utensil, improperly used or stored.
## 9 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10 Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.
violation_code_with_restype("Latin (Cuban, Dominican, Puerto Rican, South & Central American)")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 02B
## 4 04L
## 5 06C
## 6 04N
## 7 02G
## 8 10B
## 9 06D
## 10 04M
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Hot food item not held at or above 140Ã\u0082º F.
## 4 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 6 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 7 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 8 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 10 Live roaches present in facility's food and/or non-food areas.
The bar chart of the violation codes for Latin and Dessert restaurants look almost the same as other three types of restaurants. The violation associated with the evidence of mice or live mice present in facility’s food and/or non-food areas have frequently made in both types of restaurants compared to other violations. Moreover, Latin restaurants have the issue of roaches as well. The Dessert type restaurants tend to violate the code associated with sanitizing utensils and washing food contact surface more often. Though, compared to the overall (containing all the restaurants) frequency of the violation codes, the code associated with the mice one is ranked lower for these types of restaurants.
Also, we decide to study the association between the violation codes and the years. In other words, we decide to explore more to see if there are any trends in the types of violations across the years. We still focus on the 5 most common restaurants to keep it simple and more interpretable. In order to produce a plot for each year, we take the same steps as described above except that we create new columns called year and month which extract year and month information from inspection date using mutate function.
Additional note: we decide to focus on 08A, 06D, 04L, and 06C only. While including all the violation codes provide more information about what overall violation trends across the year, this can make the graph more complicated and hard to look at. So, to make it simple and more readable, we choose 08A, 06D, 04L, and 06C as our x variable to see if there are any trends in these violation types. Again, because we are most interested in the shape of the distribution for each type of restaurant and each year, we set the scale to free to prevent some of the graphs from shrinking towards 0.
violation_type <- c("08A", "06D", "04L", "06C")
top5_year <- top5 %>% filter(violation_code %in% violation_type) %>% mutate(year = factor(format(inspection_date, "%Y")), month = factor(format(inspection_date, "%m"))) %>% select(cuisine, year, violation_code) %>% group_by(cuisine, year, violation_code) %>% summarise(count = n())
ggplot(top5_year, aes(x = violation_code, y = count, fill = year)) +
geom_col(position = "dodge") +
facet_grid(cuisine~., scale = "free") +
ggtitle("Trends in frequency of violation types") +
theme_grey(16)
One noticeable feature from this graph is that there are not much data in 2013. In fact, it is also shown in the dataframe that the frequency of any violation types for 2013 is less than 10. The frequency of 08A is highest in the year of 2017 for American, Dessert, Chinese, and Italian. Only the plot of Latin restaurants has the highest frequency of 08A in the year of 2015. There is increasing pattern in any violation code for Chinese and Dessert restaurants. All of the violation code reached its peak in the year of 2017. It then decreases significantly at 2018, but it would probably be because the inspections for 2018 has not finished yet (since it is just April). While there is an increasing pattern in 04L, 06D, and 08A for Italian and American restaurants, it seems like the frequency of the violations of 06C (Food not protected from potential source of contamination during storage, preparation, transportation, display or service) has decreased after 2016. For Latin restaurants, the frequency of all four type of violations increase at 2015 and then decrease after 2016. Also, the Latin restaurants are the only one which has a decreasing trend in 04L that is associated with mice.
print(worstOfList <- arrange(filter(shrunkenData, GRADE %in% grades), desc(SCORE))[1:20,c(2,10)])
## DBA SCORE
## 1 FERRY HOUSE CAFE 94
## 2 MOTI MAHAL DELUX 83
## 3 PHAYUL RESTAURANT 80
## 4 KUM GANG SAN 74
## 5 SABOR DE CUBA 73
## 6 YOLANDA RESTAURANT 73
## 7 LOS POLLITOS III 73
## 8 FRESCO TORTILLAS 73
## 9 ROADSIDE CHICKEN & SEAFOOD 72
## 10 PAN UGO BAKERY 66
## 11 SAVOUR SICHUAN 65
## 12 THAI MAX 65
## 13 MASAKI TERIYAKI & SUSHI 64
## 14 MING KEE KITCHEN 64
## 15 INTERMEZZO 64
## 16 HESTIA MARKETPLACE 64
## 17 EXPRESS BROOK LUNCH RESTAURANT 63
## 18 BRICCO RESTAURANT 63
## 19 COFFEED 63
## 20 XING WONG BBQ 63
arrange(filter(shrunkenData, GRADE %in% grades), desc(SCORE))[1:20,c(2,10)] # select cols and rows # particular rest violate the most
## DBA SCORE
## 1 FERRY HOUSE CAFE 94
## 2 MOTI MAHAL DELUX 83
## 3 PHAYUL RESTAURANT 80
## 4 KUM GANG SAN 74
## 5 SABOR DE CUBA 73
## 6 YOLANDA RESTAURANT 73
## 7 LOS POLLITOS III 73
## 8 FRESCO TORTILLAS 73
## 9 ROADSIDE CHICKEN & SEAFOOD 72
## 10 PAN UGO BAKERY 66
## 11 SAVOUR SICHUAN 65
## 12 THAI MAX 65
## 13 MASAKI TERIYAKI & SUSHI 64
## 14 MING KEE KITCHEN 64
## 15 INTERMEZZO 64
## 16 HESTIA MARKETPLACE 64
## 17 EXPRESS BROOK LUNCH RESTAURANT 63
## 18 BRICCO RESTAURANT 63
## 19 COFFEED 63
## 20 XING WONG BBQ 63
ggplot(worstOfList, aes(reorder(DBA,SCORE), SCORE)) +
geom_bar(stat = 'identity', aes(fill = SCORE)) +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, vjust=1, color = "grey")) +
geom_hline(yintercept = 14, color = "white") +
geom_hline(yintercept = 27, color = "yellow") +
labs(title="Worst of the Worst")
# Top 10 Resteraurants Ranked by Number of Critical Violation
# Regardless of type of violation
# group data
# View(inspecttbl)
restaurant <- inspecttbl%>%group_by(INSPECTION.DATE, DBA, SCORE, GRADE, CRITICAL.FLAG,BORO, BUILDING, STREET, ZIPCODE)
# renaming DBA(doing business as usual) to restaurant
restaurant<- restaurant%>%rename( RESTAURANT = DBA)
# Count voilations according to restaurant and type of violation [~distinct] count the instances regardless of violation type per score, per restaurant, per date
count<-restaurant%>%summarise(count=n())
# select only critical violation
citical_count<-count%>%filter(CRITICAL.FLAG=='Critical')
# Find the worest restaurants by calculating the number of that violate with Grade C under critical situation
citical_count<-citical_count%>%filter(SCORE>27)%>%group_by(RESTAURANT,STREET)%>%mutate(num_critical_vio=sum(count))
# order the data # restaurant the receives violates grade C most
res_critical_vio_count<-distinct(citical_count%>%select(STREET,RESTAURANT,num_critical_vio)%>%arrange(desc(num_critical_vio)))
ggplot(res_critical_vio_count[1:10,c(2,3)], aes(reorder(RESTAURANT,num_critical_vio), num_critical_vio)) +
geom_bar(stat = 'identity',fill="tan2") +
coord_flip()+
#xlabel,ylabel
labs(title="10 Worest Resteraurants Ranked by Number of Critical Violation ")
# restaurant have the highest violation score
# citical_count_score
# Define bad restaurants as violation Score > 27 [Grade C] ~ 5 occurance of Grade C
citical_count_score<-count%>%
filter(CRITICAL.FLAG=='Critical',SCORE>27,count>5)%>%
arrange(desc(SCORE))
result<-citical_count_score[1:10,c(2,3,10)]
# save the csv
# csv<-citical_count_score
# write.csv(csv[1:15,c(1,4,2,3,10)], file = '/Users/Selina/Desktop/Visualization/EDAV-master/chart.csv',row.names = FALSE)
ggplot(result, aes(reorder(RESTAURANT,SCORE), SCORE)) +
geom_bar(stat = 'identity',fill="tan2") +
coord_flip()+
#theme(text = element_text(size=10), axis.text.x = element_text(angle=45, vjust=1, color = "grey")) +
labs(title="Worest 10 Highest Violation Score Restaurant ")
Chain Restuarant
#### INSPECTION.TYPE Violation related####
# we focus on instore sanity check of the inspection#
#extra levels(as.factor(inspecttbl1$CUISINE.DESCRIPTION))
#inspecttbl%>%filter(DBA =="STARBUCKS")
# dis-regarding rows from initial inspections as grades and scores are not counted
STARBUCKS<-inspecttbl%>%filter(DBA=="STARBUCKS", GRADE!='A', INSPECTION.TYPE!='Cycle Inspection / Initial Inspection', CRITICAL.FLAG=='Critical')
# critical score and inspection type of chain restaurant
#inspecttbl1%>%filter(DBA=="STARBUCKS")%>%group_by(DBA,INSPECTION.DATE,BORO)%>%summarise(mean(SCORE))
# store per year average critical score variation
cafe<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(year = format(INSPECTION.DATE, "%Y"))#%>%group_by(DBA,year)%>%summarise(mean(SCORE))
STARBUCKS<-cafe[grep("STARBUCKS",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
STARBUCKS$DBA<-"STARBUCKS"
DUNKIN<-cafe[grep("DUNKIN*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
DUNKIN$DBA<-"DUNKIN"
PICCOLO<-cafe[grep("PICCOLO*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
PICCOLO$DBA<-"PICCOLO"
GREGORY<-cafe[grep("GREGORY'S COFFEE*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
GREGORY$DBA<-"GREGORY"
score<-rbind(STARBUCKS,DUNKIN,GREGORY,PICCOLO)
score_boro<-score%>%filter(BORO=="MANHATTAN"|BORO=="QUEENS"|BORO=="BROOKLYN")
###########Ranked Plot of Cafe###############
ggplot(score_boro, aes(x = year, y = MEAN, fill = DBA)) +
geom_col(position = "dodge") +
facet_grid(fct_relevel(score_boro$BORO,"MANHATTAN")~.)+
ggtitle("Mean Violation Score") +
theme_grey(16)
# CAFE common violation type
STARBUCKS_type<-inspecttbl1[ grep("STARBUCKS",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/2431)%>%arrange(desc(vio_freq))
STARBUCKS_type$DBA<-"STARBUCKS"
DUNKIN_type<-inspecttbl1[ grep("DUNKIN*",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/6004)%>%arrange(desc(vio_freq))
DUNKIN_type$DBA<-"DUNKIN"
PICCOLO_type<-inspecttbl1[ grep("PICCOLO*",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/339)%>%arrange(desc(vio_freq))
PICCOLO_type$DBA<-"PICCOLO"
GREGORY_type<-inspecttbl1[ grep("GREGORY'S COFFEE*",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/98)%>%arrange(desc(vio_freq))
GREGORY_type$DBA<-"GREGORY"
type<-rbind(STARBUCKS_type[1:5,], DUNKIN_type[1:5,],PICCOLO_type[1:5,],GREGORY_type[1:5,])
# Change to base on violation code ~ instead of top violation freq.
ggplot(type, aes(x = VIOLATION.CODE, y = vio_freq, fill = DBA)) +
geom_col(position = "dodge") +
ggtitle("Most Freq Violation Code") +
theme_grey(16)
type$DBA<-factor(type$DBA)
ggplot(type, aes(x = VIOLATION.CODE, y = vio_freq)) +
geom_col(position = "dodge",fill="tan2") +
facet_grid(~ fct_relevel(type$DBA,"STARBUCKS"),scales = 'free')+
ggtitle("Most Freq Violation Code") +
theme_grey(16)
# View(distinct(inspecttbl1,VIOLATION.DESCRIPTION,VIOLATION.CODE))
# 08A vermin
# 04N flies
# select code c('10F','10B','08A')
code<-inspecttbl1%>%filter(VIOLATION.CODE=='10F'|VIOLATION.CODE=='10B'|VIOLATION.CODE=='08A')%>%mutate(year = format(INSPECTION.DATE, "%Y"),month = format(INSPECTION.DATE, "%Y"))
STARBUCKS_type<-code[ grep("STARBUCKS",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
STARBUCKS_type$DBA<-"STARBUCKS"
DUNKIN_type<-code[ grep("DUNKIN*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
DUNKIN_type$DBA<-"DUNKIN"
PICCOLO_type<-code[ grep("PICCOLO*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
PICCOLO_type$DBA<-"PICCOLO"
GREGORY_type<-code[ grep("GREGORY'S COFFEE*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
GREGORY_type$DBA<-"GREGORY"
code<-rbind(STARBUCKS_type[1:5,], DUNKIN_type[1:5,],PICCOLO_type[1:5,],GREGORY_type[1:5,])
code$DBA<-factor(code$DBA)
code$DBA<-fct_relevel(code$DBA,"STARBUCKS")
mosaic(VIOLATION.CODE ~ DBA, code, direction = c("v", "h"),labeling= labeling_border(rot_labels = c(15,0,0,0))) #gp = gpar(fill = c("blue", "lightblue"),rot_labels=c(0,90,0,0),just_labels="right" )
#, labeling= labeling_border(rot_labels = c(0,90,0,0),
# just_labels = c("center",
# "center",
# "center",
# "right"))
# per year/date
CAFE<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(year = format(INSPECTION.DATE, "%Y"),month = format(INSPECTION.DATE, "%m"))
STARBUCKS<-CAFE[grep("STARBUCKS",CAFE$DBA),]%>%group_by(year,month)%>%summarise(MEAN = mean(SCORE))
STARBUCKS<-STARBUCKS%>%filter(year!="2018",year!="2014")
#STARBUCKS
######TRY QUaRTERly SCORE # group by quater of the year
ggplot(STARBUCKS, aes(as.numeric(month), MEAN)) + geom_line() +#color = symbol
ggtitle("STARBUCKS VIOLATION SCORE") +
labs (x = "", y = "SCORE") +
facet_grid(~year)+
theme_grey(16) +
theme(legend.title = element_blank())
# select 08A meanscore with date
STARBUCKS<-CAFE[grep("STARBUCKS",CAFE$DBA),]%>%group_by(year,month)%>%filter(VIOLATION.CODE=="08A")%>%summarise(SCORE08A = mean(SCORE))
STARBUCKS<-STARBUCKS%>%filter(year!="2018",year!="2014")
#STARBUCKS
ggplot(STARBUCKS, aes(as.numeric(month), SCORE08A)) + geom_line() +
ggtitle("STARBUCKS VIOLATION SCORE") +
labs (x = "", y = "SCORE") +
facet_grid(~year)+
theme_grey(16) +
scale_x_discrete(name ="Month",
limits=seq(1, 12, 1))+
theme(legend.title = element_blank())
###########alternative#######
yearmonth<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(yearmonth = format(INSPECTION.DATE, "%Y-%m"))
STARBUCKS<-yearmonth[grep("STARBUCKS",yearmonth$DBA),]%>%group_by(yearmonth)%>%filter(VIOLATION.CODE=="08A")%>%summarise(SCORE08A = mean(SCORE))
#STARBUCKS<-STARBUCKS%>%filter(year!="2018",year!="2014")
#STARBUCKS
#as.Date(as.yearmon(STARBUCKS$yearmonth, "%Y-%m"))
p<-ggplot(STARBUCKS, aes(as.Date(as.yearmon(STARBUCKS$yearmonth, "%Y-%m")), SCORE08A)) + geom_line() +
geom_point()+
ggtitle("STARBUCKS VIOLATION SCORE") +
labs (x = "", y = "SCORE") +
# scale_x_discrete(name ="Month",
# limits=seq(1, 12, 1))+
#facet_grid(~year)+
theme_grey(16) +
theme(legend.title = element_blank())
p <- ggplotly(p)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
p
# analysis of area
STARBUCKS<-CAFE[grep("STARBUCKS",CAFE$DBA),]%>%group_by(BORO,year)%>%filter(VIOLATION.CODE=="08A")%>%summarise(SCORE08A = mean(SCORE))
STARBUCKS<-STARBUCKS%>%filter(year!="2018",year!="2014")
ggplot(STARBUCKS, aes(reorder(BORO,SCORE08A), SCORE08A)) +
geom_bar(stat = 'identity',fill="tan2") +
facet_grid(~year)+
#coord_flip()+
labs(title="Mean 08A By BORO ",x = "NYC Borough", y = "08A Violation Score")+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
(Provide a short nontechnical summary of the most revealing findings of your analysis written for a nontechnical audience. The length should be approximately two pages (if we were using pages…) Take extra care to clean up your graphs, ensuring that best practices for presentation are followed.)